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Abstract 


The curse of dimensionality is a common phenomenon which affects analysis of 
datasets characterized by large numbers of variables associated with each point. 
Problematic scenarios of this type frequently arise in classification algorithms which 
are heavily dependent upon distances between points, such as nearest-neighbor and 
fc-means clustering. Given that contributing variables follow Gaussian distribu¬ 
tions, this research derives the probability distribution that describes the distances 
between randomly generated points in n-space. The theoretical results are extended 
to examine additional properties of the distribution as the dimension becomes arbi¬ 
trarily large. With this distribution of distances between randomly generated points 
in arbitrarily large dimensions, one can then determine the significance of distance 
measurements between any collection of individual points. 


1 Introduction 


In recent years, there has been a significant increase in size among large datasets which 
possess hundreds or even thousands of features per observation. This has given rise to 
increased observations of the phenomenon known as the ’’curse of dimensionality.” First 
identified by Bellman in 1957, an application of this phenomenon occurs as the distances 
between observations grow with the number of additional dimensions. [3] One of the 
side effects of this phenomenon is that spurious results are obtained due to the increase 
in variance produced by the introduction of extraneous features (dimensions). These 
spurious results arise directly from the influence that additional distances have on the 
underlying metric used to calculate the nearest distance for classification examples in a 
training dataset. This effect is especially problematic in statistical learning algorithms 
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which rely heavily on measures of closeness or similarity between observations. 

Interest in the distances between randomly generated points was initially undertaken 
by trying to analyze the effects that additional data features (dimensions) have upon 
classification algorithms used for statistical learning with large datsets. Hastie, et al. 
provide an in-depth analysis of this phenomenon in their text “The Elements of Statis¬ 
tical Learning”, which is a widely used textbook at the graduate level. [7] The Nearest 
Neighbor algorithm is one such common machine learning technique which is known to 
be sensitive to the effects of extraneous dimensions because additional features increase 
the distances between points and reduce the contrast between meaningful neighbors and 
those which may arise purely from statistical noise. Aggarwal, Hinneburg, and Keim 
provide a very good discussion of this phenomena in their paper and term it “relative 
contrast”. [1] They observe that as the number of dimensions increases, the distance to 
a the nearest query point using a nearest neighbor algorithm increases faster than the 
difference in distances between the furthest and nearest query points. Their conclusion 
is that as the number of dimensions increases, proximity queries become meaningless due 
to proper discrimination regarding what constitutes a suitable neighbor for classification 
purposes. 

While significant research has considered this problem conceptually, the authors are 
aware of little research regarding a quantitative approach. Several years ago, there were 
several successful attempts to determine the distribution of distances between randomly 
generated points uniformly distributed within a hypersphere [2, 6, 9], randomly gener¬ 
ated points distributed on the surface of a hypersphere [10], or more general analysis 
of random points contained within some compact convex subset of points in Euclidean 
space [4]. Solomon provided additional extensive coverage of these concepts for uniformly 
distributed points in circles and spheres in his book titled “Geometric Probability.” [11] 
Interestingly, all of these observations use a uniform distribution of point density, either 
on the surface or interior of a sphere and arose from motivations in the field of geo¬ 
metric probability. The authors are unaware of any similar research that considers the 
distribution of distances between points which themselves are distributed according to 
a Standard Normal Distribution in fe-dimensional space without geometric boundaries. 
This is surprising given the frequent application of the Standard Normal distribution 
for modeling purposes and the standard practice of normalizing data prior to the appli¬ 
cation of statistical learning techniques in order to prevent dominance of one variable 
over another. In particular, the technique regarding normalization of data is highlighted 
in several of the key texts involving statistical learning. [8, 12] This paper develops 
the distribution which defines the distances between randomly generated Standard Nor¬ 
mal points in /c-dimensional space without geometric boundaries, thus providing some 
statistical tools that support further refinement of machine learning and data mining 
applications. 
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2 Distribution of Absolute Distance Between Two Gaussian Variables 


This paper provides an investigation into the effects of dimensionality on the distribution 
of distances between random points which are distributed Standard Normal in each 
of their respective dimensions. Most of the datasets confronted in high-dimensional 
problems are usually standardized in each of the attributes for respective datapoints. [8, 
12] Demonstrated below are the quantitative effects that dimensionality brings as the 
number of dimensions of the dataset increases. To the authors’ knowledge this subject 
has not been treated from a quantitative aspect in the literature. 

We begin by determining the absolute difference between two random variables {p 
and q) which are both drawn from the standardized Gaussian distribution. In this case, 
the absolute difference corresponds to one dimensional distance between points. Note 
that this distance cannot be negative. The Cumulative Density Function (CDF) which 
provides the distribution of the absolute difference between points p and q, or their 
distance, is shown below and diagrammed further in Figure 1. 
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Figure 1: Diagram of zone of integration which covers the probability that the distance 
between p and q is less than r 
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The inner term of the integral is the product of the Gaussian Probability Density 
Function (PDF) for each variable under consideration. Integrating with respect to the 
innermost integral we arrive at the expression: 





p — x 
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dp 


To determine the PDF of the above function, take the derivative of the integral with 
respect to x: 
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Thus, the above expression is the PDF for the distribution of the absolute distance 
between two standardized Gaussian variables in one dimension. A plot of this distribu¬ 
tion is shown in Figure 2. 
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Figure 2: Probability Density of the Absolute Difference between Points 


In this section we have derived the PDF for the absolute difference (i.e. the non¬ 
negative distance) between any two random variables with a standard Gaussian distri¬ 
bution. In the following sections, this concept is extended to an arbitrary number of 
/c-dimensions. 
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3 Extension of Distance to k Dimensions of Gaussian Variables 


The previous section derived the PDF for the distribution of the absolute difference in 
standard normal random variables in one dimension. We will now extend the concept 
to an arbitrary k number of dimensions and use the Euclidean metric to determine the 
distance between points. Let two arbitrary points of interest in ^-dimensions be further 
defined as T = V'fc) and F = ( 71 , 72 , 7fc)- We are interested in the Euclidean 

distance between the two points, which is defined as: 


■ k 

- li) ^ 


1/2 


We generalize to k dimensions now and begin by constructing the CDF which mea¬ 
sures the probability that two points are separated by at most a distance less than or 
equal to x. We do this by making a change of coordinates from standard Euclidan Xi 
to those in a A;-dimensional hypersphere and then concern ourselves only with the first 
2^-tant where all of the values in the Euclidean coordinate system have positive val¬ 
ues. This is possible since each Euclidean coordinate is strictly non-negative and it has 
a corresponding probability that the difference between the two randomly distributed 
components is less than or equal to its radial value. By approaching the problem in this 
manner, we need only concern ourselves with integration of all angles in the hypersphere 
from 0 to which we will designate using for each of the respective angular dimen¬ 
sions. The radial component R captures the probability that the CDF is less than or 
equal to R. 



Figure 3: Plot of region of iterated integration in the 2^-quadrant 
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We begin by making the following replacements of hyperspherical coordinates, as 
shown by Cohl in [5]. 

xi = r cos {(l)i) 

X 2 = r sin {(pi) cos {(P 2 ) 

X 3 = r sin {(pi) sin {(P 2 ) cos {(ps) 


Xk-i = r sin {(pi )... sin {(pk- 2 ) cos {(pk-i) 

Xk=r sin {(pi )... sin {(pk- 2 ) sin {(pk-i) 

After converting to hyperspherical coordinates and integrating, the spherical volume 
element d^V is given by 

d^V = sin^“^ {(pi) sin^“^ {(P 2 )... sin {(pk- 2 ) dr d(pi d(p 2 --.d(pk-i- 

The iterated integral in the first 2^- tant which defines the CDF for the probability that 
the distance between two random Gaussian points is no greater than Rvcik dimensional 
space is then given by 

l-w/2 ^7r/2 I'R -i(rcos(0i))2 - i (r sin(</)i) cos( 02 ))^ - j(r sin((/>i) sin(( 7 i 2 ) cos(03))2 

F{R,k)= ... { - y= -)(- y= -)(- y= -)... 

Jo Jo Jo v'^ 


-i(rsin((;il)...sin(</)fc_ 2 )cos((;ij,_l ))2 -i(r sin(0i)... sin((jij,_ 2 ) sin((jij,_i))2 
-)(^- 


TT 


TT 


..r^ ^ sin^ ^((/>i)sin^ ^ {(P 2 ) ...sin {(pk- 2 ) dr d(pi d(p 2 ...d(pk-i. 


However, since we have completed the change of coordinates to the first 2^-tant of a 
hypersphere, the above expansion simplifies to 


t /2 f-K/2 fR / I 


F{R,k) = J -j J r^-isin^-2(0i)sin^-3(,/)2)... 

... sin {(pk- 2 ) dr d(pi d(p 2 ...d(pk-i. 

Rearranging the terms so that we are only integrating with respect to the variable 
in question at each iterated integral, we have 

F(R,A:)=(—j {J^ {sin{(Pk-2)) ■■■ {{sin’^-^ {(P 2 ))... 


[ ((sin*' ^(<('i))( [ (e 4 )r^ ^dr)...d(pi) d(p 2 )...d(pk- 2 ) 

10 Jo 
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Beginning with the innermost integral, we make use of the solution to the integral 


rR 


k—l jyk ( r>2\ 


ir^-^dr = {R^) 


k 


Since k is a positive integer, the above result simplifies to 

After integration, the result is a constant with respect to all subsequent iterated 
integrals. Thus, we can move it to the outside of the iterated integrals as shown below. 


F{R,k) = 


1 


k-l 


- r 


k R^ 
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( [ (sin((/)fc_ 2 ))... / ((sin'‘ '*( 02 )) / (sin'' ^(0i)(i0i) d(j) 2 )- 

Jo Jo 


10 JO 


...d(pk-2) d(f)k_ 


1- 


Now we have a series of iterated integrals for decreasing powers of the sine function 
for all integral values for k — 2 to 1. We now use the following identity which holds 
for any m such that the real component is greater than —1 and can be verified using a 
computer algebra system. 



sm^{e)de 


2r(f+ 1)' 


Since all powers of m in the integral are strictly positive integers, each integral in our 
iterated integral expression can be replaced with the closed form expression above. The 
result is strictly a constant depending only on the power that the sine function is raised 
to in the integrand. Since integration occurs for every power of sine for 1 < m < k — 2, 
the results of each of these integrals can be gathered as constants to the outside of the 
integral and formed into a product extending from 1 to /c — 2 which leaves us with the 
following result 


F{R,k) 



2r(f+ i)f 



dcpk-i- 


The final integral found on the far right evaluates to ^, which leaves us the simplified 
CDF shown below. 


F{R,k) 



2r (f + 1) ,1 



7 



After further simplification this reduces to 


F{R,k) 



r(^) 


r(^ + i)- 


Closer examination of the iterated product by expansion of terms and cancellation 
produces the following identity: 


n r(i) 1 

ii,r{f + i) r(|) r(|)- 

Substituting this result back into the CDF, we have 



Taking the derivative with respect to R yields the PDF below. 


f{R,k) 
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This reduces to f{R, k) = - %(k\ —) where R is the absolute difference between 

D 2 j 

two random variables which have each of their k components distributed Gaussian. This 
distribution is strictly non-negative, and a quick integration over all possible values of 
R from 0 to infinity equals 1, thereby confirming the validity of the PDF. 


A plot of the PDF is shown in Figure 4 for a variety of dimensions (k), where 
k = 1,2,3,4,5,10,20,30,40,50,100. This plot is interesting in that as k ^ oo, the 
distribution resembles a shifted normal. Furthermore, due to the underlying calculation 
of Euclidean distance, there is no asymptotic limit regarding shift occurence. 



f(k, X) 



Number of dimensions k 


Figure 4: Series of overlaid plots of PDFs for various values of k 


In summary, we evaluated the Euclidean distance between the absolute differences 
of two A:-dimensional random variables by examining them as the equivalent of a set 
of points in the hrst 2”'-tant of a hypersphere. Through this conversion, we are able 
to evaluate an iterated integral to a simple closed form expression for the CDF which 
reflects the distance between two such points. Further differentiation with respect to the 
radial component r produces the PDF of the distribution for the distance between two 
points of interest in fc-dimensional space. 
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4 Calculation of Raw and Central Moments 


Anytime that a new distribution is derived, it is natural to discuss the moments. We 
will do this by evaluating the following integral to calculate the raw moment of the 
distribution. 


roo I 2^-kp-^ 

I dR 

'« \ r(|) 


Factoring out constants, we are left with 


'il—k roo 


r(i) 




dR. 


The following identity holds, provided that k + n is strictly greater than zero. This is 
always true in this analysis given that the dimensionality has k > 1 and the raw moment 
has n > 1. 

j^n+k-l\ _ 2n+k-lj^ f k n 


e- — R^‘ 


10 \ / \ 2 
This simplifies to the expression below for the raw moment in /^-dimensions. 

_ 2"r (jy) 

Jo \ r(|) ) r{i) 

Therefore, the first four raw moments for /c-dimensions are shown below. 

p2 f \ jp _ 4r(^) 

^0 V r(|) r(|) 

JOO ^3 dR = 

coo jy4 ( 2l-* 

Jo " I r(|) 


roo r, I 2i-fce~'w_Rfc-i \ jp _ 2r(^) 




dR = ^ (first raw moment). 


r(l) 


(second raw moment). 


(third raw moment). 


roo jj 4 (\ 173 16 r(.^^) , , 

' ^ ' - -£- - — ' dR = —(fourth raw moment). 


r(l) 


Calculation of the central moments begins with the consideration of the first raw 
moment identified above and is used to calculate the central moment in the integral 


roo 

/ (x - f{x)dx. 

Jo 
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Substituting values for /x and f{x) and restricting the lower bound of the integral to 
reflect strictly non-negative values, we have the following expression. 




n f ^ ^ 


r(l) 


dR 


Unfortunately, determining a closed form expression for the central moment is 
non-trivial, but this paper will consider the second, third, and fourth central moments. 
We begin by determining the second central moment 1^2 ■ 


^J ‘2 





Now determine the third central moment 
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Finally, determine the fourth central 
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= 4k(k + 2) +-^- . . 





r(r 

The skewness for this distribution is as indicated below using Hi, where i is the i-th 
central moment. 

_ 3 7. _ 


7l 




//-I 


The kurtosis for this distribution is provided below using the expression for /32- 

TT 4:^-^{k - 2) r(A:)2 - 48 r(^)4 


/?2 = ^ = 4k{k + 2) + 
M 2 


r(|)4 


All of the above identities hold provided that Re{k) > 0, which holds for the prob¬ 
lem under consideration in this paper. Thus, the closed form expression for the raw 
moment of the distance distribution in /c-dimensions is shown. While a closed form ex¬ 
pression for the central moment is not derived, the second, third, and fourth central 
moments are directly calculated. 
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5 Discussion and Further Work 


In this paper we have developed the PDF for the distribution of Euclidean distance 
between any two points with k features or dimensions. Our methodology utilized a 
change of coordinates and allowed for the derivation of the CDF for the distribution of 
distances between points in /c-dimensions. Subsequent derivation provided the pdf and 
we were pleasantly surprised with the elegant closed-form expression. 


The distribution of distances between random points is especially important to data 
mining applications since several important classification algorithms rely on the use of 
distance to nearest known examples to determine the classification of unknown instances. 
Frequently it is the case that a significant portion of features or dimensions in a given 
dataset are completely irrelevant to the problem, or are themselves noise. The inclusion 
of variables which provide no additional accuracy actually serves to increase the distance 
between points, and the effect of the added distance is distributed as outlined in this 
paper, and is dependent upon the number of extraneous dimensions. Additionally, spuri¬ 
ous results may be given erroneous meaning when such results emerge from the inclusion 
of a large number of randomly distributed features in the data under consideration. 

Since we have derived the distribution of distances between neighboring points, future 
research will evaluate the use of this distribution in the reduction of such spurious 
associations and the probability that these associations would arise from the number 
of underlying features. 
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